Translating the InChI: adapting neural machine translation to predict IUPAC names from a chemical identifier

نویسندگان

چکیده

Abstract We present a sequence-to-sequence machine learning model for predicting the IUPAC name of chemical from its standard International Chemical Identifier (InChI). The uses two stacks transformers in an encoder-decoder architecture, setup similar to neural networks used state-of-the-art translation. Unlike translation, which usually tokenizes input and output into words or sub-words, our processes InChI predicts character by character. was trained on dataset 10 million InChI/IUPAC pairs freely downloaded National Library Medicine’s online PubChem service. Training took seven days Tesla K80 GPU, achieved test set accuracy 91%. performed particularly well organics, with exception macrocycles, comparable commercial generation software. predictions were less accurate inorganic organometallic compounds. This can be explained inherent limitations representing inorganics, as low coverage training data.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

InChI, the IUPAC International Chemical Identifier

This paper documents the design, layout and algorithms of the IUPAC International Chemical Identifier, InChI.

متن کامل

InChI - the worldwide chemical structure identifier standard

Since its public introduction in 2005 the IUPAC InChI chemical structure identifier standard has become the international, worldwide standard for defined chemical structures. This article will describe the extensive use and dissemination of the InChI and InChIKey structure representations by and for the world-wide chemistry community, the chemical information community, and major publishers and...

متن کامل

Detection of IUPAC and IUPAC-like chemical names

MOTIVATION Chemical compounds like small signal molecules or other biological active chemical substances are an important entity class in life science publications and patents. Several representations and nomenclatures for chemicals like SMILES, InChI, IUPAC or trivial names exist. Only SMILES and InChI names allow a direct structure search, but in biomedical texts trivial names and Iupac like ...

متن کامل

Translating Phrases in Neural Machine Translation

Phrases play an important role in natural language understanding and machine translation (Sag et al., 2002; Villavicencio et al., 2005). However, it is difficult to integrate them into current neural machine translation (NMT) which reads and generates sentences word by word. In this work, we propose a method to translate phrases in NMT by integrating a phrase memory storing target phrases from ...

متن کامل

constructing a test to predict the translation performance of english translation ma graduates on legal correspondence and deeds as a profession

regarding the ever evolving and improving world on different aspects of knowledge, the need to a worldwide communication would emerge stronger than ever before which calls for special attention on the judgments and best choices for intermediating between the nations. as the language skills for translation are tested separately from translation skills themselves, to assess translation skills pro...

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Journal of Cheminformatics

سال: 2021

ISSN: ['1758-2946']

DOI: https://doi.org/10.1186/s13321-021-00535-x